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Abstract — When the same set of genes appear in two top 
ranking gene lists in two different studies, it is often of 
interest to estimate the probability for this being a chance 
event. This overlapping probability is well known to follow the 
hypergeometric distribution. Usually, the lengths of top-ranking 
gene lists are assumed to be fixed, by using a pre-set criterion 
on, e.g., p-value for the t-test. We investigate how overlapping 
probability changes with the gene selection criterion, or simply, 
with the length of the top-ranking gene lists. It is concluded 
that overlapping probability is indeed a function of the gene 
list length, and its statistical significance should be quoted in 
the context of gene selection criterion. 

I. INTRODUCTION 

One of the most common tasks in microarray analysis is 
to identify a list of genes that are differentially expressed 
under two conditions, such as being affected by a disease 
vs. normal, before vs. after a medical treatment, and one vs. 
another disease subtype. The number of genes on the top- 
ranking list is usually much smaller than the total number of 
genes on the chip, n. If the same type of microarray chip is 
used for two different studies (e.g. disease-A vs. control, and 
disease-B vs. control), two differentially expressed gene lists 
can be obtained, with ni and n2 genes. Researchers often 
find the same genes appear in both lists and hypothesize 
that these common genes are involved the etiology of both 
diseases. 

However, for such a hypothesis to be convincing, one 
has to first estimate the probability for overlapping genes 
by chance alone. In other words, if two lists of genes are 
selected out of n genes randomly, we would like to calculate 
the probability for m genes in common in the two lists, with 
the lengths of the two lists being rii and 712- This overlapping 
probability is known to follow the hypergeometric distribu- 
tion The name hypergeometric distribution was first used 
in [1], and was popularized by its role in Fisher's exact test 
[2]. 

In microarray analysis, overlapping probability and hyper- 
geometric distribution mainly appear in testing the enrich- 
ment of genes in certain functional category [3], [4], [5], 
[6], [7], [8], [9], [10]. In this appHcation, the first list is 
the top-ranking differentially expressed genes, and a gene 
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' Despite certain similarity, this problem is not the birthday problem - the 
probability for two people in a room to have the same birthday. 



selection process is involved. The second list is nevertheless 
given: n2 genes are known to be in a pathway, a member 
of a protein family, described by a gene ontology term, etc. 
One asks the question on chance probability for m out of 
ni selected genes to be in a given pathway, a protein family, 
and describable by a gene ontology term. Fixing n2 or not 
is the main difference between their application and ours. 

When a different gene selection criterion is used, the 
number of genes in the two top-ranking lists of two studies 
{rii and 712) will also change. Because the stringency of a 
gene selection criterion is always adjustable and to some 
extent arbitrary, we would like to examine whether these 
changes will affect the overlapping probability. At two ex- 
treme situations, very small ni — n2 ^ I and very large 
711=^2= n, it is clear that the number of overlapping 
genes is m = and m ^ n. These m values appear 100% 
of the times, so the corresponding p-value is equal to 1, i.e., 
not significant. For intermediate ni « n2 values, it is not 
clear what the overlapping probability and significance will 
be, and it is the topic of this abstract. 

II. HYPERGEOMETRIC DISTRIBUTION AND 
OVERLAPPING P-VALUES 

Given integers n, ni, 712, m (max(ni,n2) < n and m < 
min(ni,n2) ), the hypergeometric distribution is defined as 
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where C(n, to) is the number of possibilities of choosing m 
objects out of n objects: C{7i,m) = nl/[m\{n — m)!]. 

When ni genes are randomly chosen from the total of n 
genes, and another random sampling leads to 712 genes, the 
probability that the two lists of genes have to in common 
is exactly the hypergeometric probability P{m). This can be 
proven by the following steps: 1) The total number of possi- 
ble choices for the two lists of genes is C(7i, ni) ■ C{n, 712). 

2) There are C(ri, tii) possibilities for choosing the first list. 

3) Among the 711 genes in the first list, there are C(ni,m) 
possibilities for choosing to genes to be in common with 
the second list. 4) In the second list, besides the m genes 
that are in common with the first list, the remaining 712 — to 
genes are chosen among the n — rii "leftover" genes not in 
the first list, thus C(n — ni, 712 — to) possibilities. The P{m) 
is simply (#2 x #3 x #4) / #1. Note that tii and 712 can be 
switched without changing the P(to) value. 
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Fig. 1. First column: proportion of overlapping genes between two top 
ranking gene lists for a pair of studies (m/ni) as a function of the gene list 
length (ni(= n2)). Top is for gene ranking by t-test and bottom is for gene 
ranking by logistic regression. The overlapping proportion for two randomly 
shuffled lists is shown in crosses, and the line m/ni = ni/n is marked. 
Second column: observed number of overlapping genes (m) subtract the 
expected number of overlapping genes (n j /n). 

It is usually more interesting to calculate the sum of P{m) 
for m's equal or larger than the observed value (i.e., the p- 
value): 

min(ni,ra2) min(ni,n2) m — 1 

p-value = ^ p{k) = ^ p{k) - ^ p{k) 

k=m k=0 k=0 

In statistical package R ( http: //www.r-project.orgJ^ , there are 
at least two ways to calculate the overlapping p-value. The 
first is to use the accumulative distribution of hypergeo- 
metric distribution, phyper(m, ni, n — ni, n2)'. p- value — 
p/iyper(min(ni, 712), ni, n—ni,n2)—phyper{m—l, ni, 
711,712) if 771 > 0, and p-value=l if 771 = 0. The second 
method is to use the p-value from the Fisher's exact test on 
the following 2-by-2 table: 
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The two approaches lead to the identical result. 

III. PROPORTION OF OVERLAPPING GENES IN A 
COLLECTION OF MICROARRAY DATASET 

In hypergeometric distribution, the number of overlapping 
elements m is an independent variable from the the list 
lengths 771, 772. In order to get a rough idea on how m 
changes with the list lengths, we use three real microarray 
datasets. Theese studies concern three autoimmune diseases: 
rheumatoid arthritis (RA), systemic lupus erythematosus 
(SLE), and psoriatic arthritis (PsA), described in details 




Fig. 2. Overlapping significance as measured by — log jq(p- value) where 
p-value is obtained by the hypergeometric distribution, as a function of 
'n-l{= 712), the number of genes in the top-ranking gene lists. The R 
program reports p- value to be zero whenever it is lower than 2.2xl0~^^, 
and we use a ceiling of 15.65758 = ~ logiQ(2.2 X 10"^'') in the plot. Six 
lines are shown for three study pairs (RA-SLE, SLE-PsA, RA-PsA) and two 
tests/models (t-test and logistic regression). Similar overlapping significance 
for two randomly shuffled lists is also shown (indicated by crosses). 

in [11], [12], [13]. The number of controls (C) and pa- 
tients (P) in these three datasets are (C=39, P=46), (C=41, 
P=81), and (C=19, P=19), respectively. The total number 
of genes/probe-sets is n =22283, and the expression levels 
are log transformed. Genes are ranked for their degree of 
differential expression which can be measured by various 
tests or models, such as <-test and logistic regression. 

For any pair of studies, with a fixed number of top- 
ranking gene lists 77i(= 712), one can count the number of 
overlapping genes 777 and the proportion 771/711 (= 771/712). 
Fig[n (left column) shows this proportion as a function of 
7ii(= 772) for three study -paks (RA-SLE, SLE-PsA, RA- 
PsA) as well as for two ranking methods (t-test and logistic 
regression). Similar overlapping proportion of two random 
shuffled lists is also indicated in Fig[2as crosses. 

When 77i(= 712) is small, m is more likely to be zero, so 
the proportion is also zero. When 771 (= 772) approaches the 
total number of genes, 77, all genes are overlapping genes, 
and the proportion is L Fig. Q indeed shows these trends 
at the two extreme points. In order to check behavior in- 
between, we draw a reference line in Fig[l](left column) that 
assume a linear relationship between m/ni and ni/n. Most 
of the points on Fig[Oare above this line, and the overlapping 
proportion of two random lists is exactly on this line. 

To have an idea of the absolute number of common 
genes more than expected by random chance, Fig^ (right 
column) plots the observed m subtract the expected rriexp = 
n\/n{— n\/n) as a function of 7ii(= 712)- The maximum 
difference between the observed and expected is reached 
between 771 = 5000 and 771 = 10000. The difference of 
observed and expected tti's can be as much as 600-800. 

IV. OVERLAPPING SIGNIFICANCE 

The overlapping p-value corresponding to the m counts 
plotted in Fig[nwas calculated by the hypergeometric distri- 
bution, and is shown in Fig|2l y-axis is — logiQ (p- value), and 
X-axis is 771 (= 712)- Six lines are shown for three compar- 
isons (RA-SLE, SLE-PsA, RA-PsA) and two measurements 
of the differential expression (t-test and logistic regression). 
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Fig. 3. The test significance (— log j^g (p- value)) from t-test of n =22283 
genes sorted by the averaged expression level (log-transformed) across all 
245 samples in 3 studies (RA, SLE, PsA). The three t-tests are for RA vs. 
control, SLE vs. control, and PsA vs. control. 

Zero p-values are converted to 2.2 xlO~^^ which is the 
minimum value reported by R program. Fig|2] shows that 
besides the two ends (m — ni — n2 — Q and m = ni = 
77.2 = n) where the p-value is 1, the overlapping significance 
quickly increases with the length of top-ranking gene list 
nil— n2), and can be extremely significant when a large 
number of genes are kept in the two lists for comparison. 

This result confirm our previous suspicion that overlapping 
significance is a function of the gene list lengths. If the 
selection of rii , n2 is arbitrary, the overlapping significance 
thus calculated is also arbitrary. It is not surprising that 
overlapping significance may keep increasing (or, p-value 
decreasing) with the increase of ni(= 77.2), because p-value 
in general depends on the sample size. When a signal is real 
(true positive), p-value will monotonically decrease with the 
sample size. On the contrast, if a true signal is absent, the 
sample size does not affect the conclusion. As can be seen 
in Fig|2 the overlapping significance for two random lists 
does not really change with ni(= 71.2). 

One may argue that it is unlikely to consider top 5000 
genes as being differentially expressed, because by a typical 
selection criterion (e.g. p-value of t-test smaller than 0.01, 
with or without multiple testing correction), the number of 
genes selected is less than a few hundreds. However, as can 
be seen in Fig|2] even in the range of 10-500, the overlapping 
p-value changes dramatically. 

This pitfall of gene-list-length dependence of overlapping 
p-values has not been noticed before perhaps because in other 
application of hypergeometric distribution for calculating 
overlapping probability, the length of the second list 712 
is fixed, for example, in the study of overrepresentation 
of genes in certain pathway. The number of overlapping 
genes m is then constrained from above by min(rii, 712) 
even though the length of the first list, 71 1, might increase by 
relaxing the gene selection criterion. 

V. THE EFFECTS OF UNEXPRESSED GENES 

There are many genes/probe-sets on the microarray chip 
that do not register much signal. Since these low-expressed 
genes are lowly expressed in both control and patient sam- 
ples, they usually do not appear in the top-ranking differen- 
tially expressed gene list. Fig|3l shows — log^Q (p-value) of 
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Fig. 4. Several measures of overlapping genes between a pair of studies as 
a function of the number of genes included in the top-ranking hst, for the 
reduced dataset with 15283 genes. First column: proportion of overlapping 
genes (m/ni); second column: number of observed overlapping genes sub- 
tracting the number of expected (m— rij /15283); third column: — logig (p- 
value) by the hypergeometric distribution. First row is for lists ranked by 
i-test result, and second row is for lists ranked by logistic regression. 

each gene of 3 t-tests sorted by average expression (log- 
transformed) across all 245 samples in 3 datasets (for both 
cases and controls). Although we cannot use the average ex- 
pression level to predict the degree of differential expression, 
there is a general trend for low-expressed genes to rank lower 
in the differentially expressed list as seen from Fig|5] 

We removed 7000 genes with lower overall expression 
across all samples, leaving n — 15283 genes. FigsQ and 
121 are reproduced in Fig|3 for the dataset with a reduced 
gene pool. As in FigsQ and |2] the observed number of 
overlapping genes 771 is much larger than the expected, 
though the difference peaks at 400-600, as versus 600-800 in 
FiglD The overlapping significance as measured by — log(p- 
value) again quickly moves up with r7,i(= 77,2) as shown in 
the last column of Fig|3 

The qualitative similarity between Figs[n |2] and Fig|4] 
indicates that the presence of low-expressed genes does not 
affect our conclusion. 

VI. CONCLUSIONS AND FUTURE WORKS 

A. Conclusions 

Using the hypergeometric distribution to calculate the 
overlapping probability between two top-ranking differen- 
tially expressed genes in two studies, we have shown that the 
overlapping significance depends on the stringency of gene 
selection criterion, or equivalently, the length of the gene 
lists. This observation presents a problem when an overlap- 
ping p-value is reported but the gene selection criterion is not 
specified. On the other hand, the increase of the overlapping 
significance with the gene list length can be an indication 
that the significant overlapping of genes is a true signal. 



B. Future Works 

The overlapping probability calculated here assumes the 
two top-ranking gene Usts are selected from the same pool 
of n genes. If the two studies are based on different chip 
platforms, the two initial gene pools are not identical, though 
there are perhaps certain common genes. We plan to derive 
the overlapping distribution for this situation. 

We also plan to study the probability for genes appearing 
in three top-ranking gene lists. Although a permutation based 
approach comparing multiple studies was proposed in [14], 
there is no analytic formula available. 
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